fix(custom/orfnormalise): treat ribotish GenomePos start as 0-based#12173
fix(custom/orfnormalise): treat ribotish GenomePos start as 0-based#12173pinin4fjords wants to merge 3 commits into
Conversation
ribotish `predict` emits 0-based half-open `GenomePos` coordinates, but the parser subtracted 1 from the start, treating it as 1-based. That shifted every ribotish-derived ORF 5' by one base: a reading-frame shift on + strand records (premature in-frame stops, and frame-2 codon positions in any downstream codon-aware step) and a 3' overhang on - strand. Verified against the genome - the corrected start lands on the ATG and translates to a clean full-length ORF. Also: - migrate the test fixtures off the deleted `pinin4fjords/test-datasets` `add-orf-prediction-fixtures` branch to `modules_testdata_base_path` (nf-core/test-datasets), which the test now depends on to run at all. - add a codon-boundary regression guard asserting every emitted ORF spans a whole number of codons, which catches this class of coordinate bug for all callers. - regenerate the orfnormalise and orftable_fasta_gtf_buildorfcatalogue snapshots (only the ribotish-derived records change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5243c7b to
8aa7c1f
Compare
There was a problem hiding this comment.
Pull request overview
Fixes a coordinate-parsing bug in custom/orfnormalise for ribotish predict output by treating GenomePos starts as 0-based (half-open) instead of incorrectly converting them as if they were 1-based. This aligns ribotish-derived ORFs with correct frame/codon boundaries and updates tests/snapshots accordingly.
Changes:
- Correct
_parse_ribotish_genposto keep the ribotishGenomePosstart coordinate as-is (0-based half-open). - Update module tests to fetch fixtures via
params.modules_testdata_base_pathand add a codon-span regression guard (sum(blockSizes) % 3 == 0) across caller fixtures. - Regenerate nf-test snapshots for
custom/orfnormaliseand the dependent subworkflow.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
modules/nf-core/custom/orfnormalise/templates/orfnormalise.py |
Fix ribotish GenomePos start coordinate handling to be 0-based half-open. |
modules/nf-core/custom/orfnormalise/tests/main.nf.test |
Switch fixtures to shared test-datasets base path and add codon-span regression assertions. |
modules/nf-core/custom/orfnormalise/tests/main.nf.test.snap |
Update expected outputs for ribotish scenario and metadata ordering/timestamps. |
subworkflows/nf-core/orftable_fasta_gtf_buildorfcatalogue/tests/main.nf.test.snap |
Update expected catalogue outputs/snapshots reflecting corrected ribotish-derived records. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
suhrig
left a comment
There was a problem hiding this comment.
Thank you for spotting/fixing this!
The %3 assertion is a great idea!
| continue | ||
| if b_i < a_i: | ||
| a_i, b_i = b_i, a_i | ||
| out.append((a_i - 1, b_i)) |
There was a problem hiding this comment.
Do we need to omit - 1 here as well? I haven't confirmed empirically, but I'm assuming the blocks column of Ribotish is 0-based, too.
Caution: This function is also used for parsing of Ribotricer output, which may be 1-based.
There was a problem hiding this comment.
Good catch, and you're right to be wary of parse_intervals being shared.
Checked it empirically: ribotish predict emits 19 columns (Gid, Tid, Symbol, GeneType, GenomePos, StartCodon, Start, Stop, TisType, TISGroup, TISCounts, TISPvalue, RiboPvalue, RiboPStatus, FisherPvalue, TISQvalue, FrameQvalue, FisherQvalue, AALen) and no Blocks column, so parse_intervals(row.get("Blocks", ...)) never actually fired - exon structure is always recovered from the GenomePos span intersected with the transcript exons. A 4-exon ribotish ORF in a full pipeline run reconstructs and translates cleanly through that fallback after the fix, so the GenomePos correction is the whole story for ribotish. I've dropped the dead Blocks branch so ribotish always uses that verified path (output unchanged, snapshots still match).
And yes, parse_intervals stays 1-based on purpose: ribotricer is 1-based (its fixtures translate to clean ATG-start, no-internal-stop, len % 3 == 0 ORFs through that -1), so removing it there would have broken ribotricer. Leaving it alone was deliberate.
ribotish `predict` reports a single genomic span (GenomePos) and no per-exon
`Blocks` column, so `parse_intervals(row.get("Blocks", ...))` never fired -
exon structure is always recovered from the GenomePos span intersected with
the transcript's exons. Remove the dead branch so ribotish always uses that
verified path. This also removes a latent foot-gun: `parse_intervals` is
1-based (correct for ribotricer), so had ribotish ever emitted a 0-based
`Blocks` column it would have reintroduced the off-by-one this PR fixes.
Output is unchanged (the branch was dead); snapshots match without regen.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
custom/orfnormalisemis-parses ribotishpredictcoordinates. ribotish emits 0-based half-openGenomePos(andStart/Stop) coordinates, but_parse_ribotish_genpossubtracted 1 from the start, treating it as 1-based. This shifts every ribotish-derived ORF 5' by one base:thickStart/blocks no longer start on the ATG, so any downstream codon-aware consumer (e.g. expanding the catalogue into in-frame codon positions for P-site quantification) walks frame 2, and the catalogue AA FASTA translates with premature in-frame stops.Verification
Against the chr20 reference for a worked ribotish ORF (
GenomePos=20:326098-326872:+,AALen=257):326098→ codonATG, clean translation to 257 aa then stop at codon 258 (matchesAALen);-1start326097→ codonCAT, premature stop at codon 72.After the fix, all five caller fixtures (ribocode, ribotish, ribotricer, rpbp, price) emit ORFs whose blockSizes sum is a multiple of 3, and every chr20 ORF starts on ATG with no internal stop codons. Only ribotish was affected; the other callers were already correct.
Changes
_parse_ribotish_genposuses theGenomePosstart as-is (the end was already treated as 0-based-exclusive).pinin4fjords/test-datasets@add-orf-prediction-fixturesbranch toparams.modules_testdata_base_path(nf-core/test-datasets). The test could not run as-is (dead URLs → 404).sum(blockSizes) % 3 == 0), which catches this class of coordinate/frame bug for all callers.custom/orfnormaliseand the dependentorftable_fasta_gtf_buildorfcataloguesubworkflow snapshots. Only the ribotish-derived records change.PR checklist
nf-test test modules/nf-core/custom/orfnormalise/... subworkflows/nf-core/orftable_fasta_gtf_buildorfcatalogue/..., NXF 25.04.8,--profile docker).🤖 Generated with Claude Code